Q1

library(tidyverse)
setwd("C:/Users/User/Documents/GitHub/biomarkers-group9")
Warning: The working directory was changed to C:/Users/User/Documents/GitHub/biomarkers-group9 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
## 1. Get data
# get names
var_names <- read_csv('data/biomarker-raw.csv', 
                     col_names = F, 
                     n_max = 2, 
                     col_select = -(1:2)) %>%
  t() %>%
  as_tibble() %>%
  rename(name = V1, 
         abbreviation = V2) %>%
  na.omit()
Rows: 2 Columns: 1318── Column specification ─────────────────────────────────────────
Delimiter: ","
chr (1318): X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
# function for trimming outliers (good idea??)
trim <- function(x, .at){
  x[abs(x) > .at] <- sign(x[abs(x) > .at])*.at
  return(x)
}
# read in data
biomarker_raw <- read_csv('data/biomarker-raw.csv', 
         skip = 2,
         col_select = -2L,
         col_names = c('group', 
                       'empty',
                       pull(var_names, abbreviation),
                       'ados'),
         na = c('-', '')) %>%
  filter(!is.na(group)) %>%
  # reorder columns
  select(group, ados, everything())
Rows: 155 Columns: 1319── Column specification ─────────────────────────────────────────
Delimiter: ","
chr    (1): group
dbl (1318): CHIP, CEBPB, NSE, PIAS4, IL-10 Ra, STAT3, IRF1, c...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# export as r binary
save(list = 'biomarker_raw', 
     file = 'data/biomarker-raw.RData')

biomarker_raw
biomarker_clean

##2. Plot the distribution for single variables
set.seed(1234)
n2 <- sample(2:1319,size = 5)
## Get mean values
## For raw data
raw_var <- biomarker_raw %>% select(n2)

## For clean data
clean_var <- biomarker_clean %>% select(n2)

## Pull the names
col.name <- colnames(raw_var)

## plot the histograms for single variable distribution
for (i in 1:5) {
  par(mfrow=c(1,2))
## Raw 
hist(pull(raw_var[i]),main = rbind(col.name[i],"raw"),xlab=rbind(col.name[i],"raw"))
## Clean 
hist(pull(clean_var[i]),main = rbind(col.name[i],"clean"),xlab=rbind(col.name[i],"clean"))
}

In this part, we mainly draw the histograms to see the distributions of the data before and after log transformations. Then, we try to compare and find the properties of the distributions after log transformation.

In step2, the data are collected from the mean values of 650 random selected variables before and after log transformations. In this case, we want to compare the distributions of the mean values before and after log transformations, which could represents the distribution of the whole data set. According to the two histograms above, it is pretty obvious that the distribution of mean values are highly right skewed. Besides, the range of the distribution is very large even if we set the xlim. However, after log transformation, it is easy to find that the range of clean distribution becomes much smaller (from -0.06 to 0.05), and the distribution are more centered to middle at x=0. Besides, compared to the raw data distribution, the new distribution are not that skewed to the right.

In step3, we mainly random select 5 proteins to see the distribution change of the single variable before and after the log transformation. In this case, we could find how log transformation affect the distribution of single protein level. Very Similar to what we observe for the distributions of mean values above, we could find that the first four raw distribution are skewed to the right with large ranges. For the last variable, CHL1, it also slightly skewed to the right. After log transformation, most new distributions become much likely to standard normal distribution centered at x=0, with range -3 to 3. Only for hnRNP K, its new distribution is still skewed to right because its original distribution is too skewed.

Therefore, it is easy to find that log transformation could help us transform our data from an highly skewed distribution to a normal distribution, also decreasing the range of the data set. There are a lot of advantage to do the log transformation. First of all, after decreasing the range of the data, we could easily cluster the means and variances of different variables to a small range, which could help us easily observe and operate them.

More important, if we want to make regression model with those data in the future, the original data might have some disadvantages. When modeling variables with non-linear relationships, the chances of producing errors may also be skewed negatively. In theory, we want to produce the smallest error possible when making a prediction, while also taking into account that we should not be over fitting the model. Over fitting occurs when there are too many dependent variables in play that it does not have enough generalization of the data set to make a valid prediction.Therefore, the transformed data could effectively decrease the dependency among variables to decrease the chances of over fitting model, and decrease the prediction errors at the same time. Thus, using the transformation of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

LS0tDQp0aXRsZTogIkt1bnhpYW8gR2FvIFExIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQpkYXRlOiAiMjAyMi0xMC0yOSINCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQpRMQ0KYGBge3J9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCiMjIDEuIEdldCBkYXRhDQojIGdldCBuYW1lcw0KdmFyX25hbWVzIDwtIHJlYWRfY3N2KCdkYXRhL2Jpb21hcmtlci1yYXcuY3N2JywgDQogICAgICAgICAgICAgICAgICAgICBjb2xfbmFtZXMgPSBGLCANCiAgICAgICAgICAgICAgICAgICAgIG5fbWF4ID0gMiwgDQogICAgICAgICAgICAgICAgICAgICBjb2xfc2VsZWN0ID0gLSgxOjIpKSAlPiUNCiAgdCgpICU+JQ0KICBhc190aWJibGUoKSAlPiUNCiAgcmVuYW1lKG5hbWUgPSBWMSwgDQogICAgICAgICBhYmJyZXZpYXRpb24gPSBWMikgJT4lDQogIG5hLm9taXQoKQ0KDQojIGZ1bmN0aW9uIGZvciB0cmltbWluZyBvdXRsaWVycyAoZ29vZCBpZGVhPz8pDQp0cmltIDwtIGZ1bmN0aW9uKHgsIC5hdCl7DQogIHhbYWJzKHgpID4gLmF0XSA8LSBzaWduKHhbYWJzKHgpID4gLmF0XSkqLmF0DQogIHJldHVybih4KQ0KfQ0KIyByZWFkIGluIGRhdGENCmJpb21hcmtlcl9yYXcgPC0gcmVhZF9jc3YoJ2RhdGEvYmlvbWFya2VyLXJhdy5jc3YnLCANCiAgICAgICAgIHNraXAgPSAyLA0KICAgICAgICAgY29sX3NlbGVjdCA9IC0yTCwNCiAgICAgICAgIGNvbF9uYW1lcyA9IGMoJ2dyb3VwJywgDQogICAgICAgICAgICAgICAgICAgICAgICdlbXB0eScsDQogICAgICAgICAgICAgICAgICAgICAgIHB1bGwodmFyX25hbWVzLCBhYmJyZXZpYXRpb24pLA0KICAgICAgICAgICAgICAgICAgICAgICAnYWRvcycpLA0KICAgICAgICAgbmEgPSBjKCctJywgJycpKSAlPiUNCiAgZmlsdGVyKCFpcy5uYShncm91cCkpICU+JQ0KICAjIHJlb3JkZXIgY29sdW1ucw0KICBzZWxlY3QoZ3JvdXAsIGFkb3MsIGV2ZXJ5dGhpbmcoKSkNCiMgZXhwb3J0IGFzIHIgYmluYXJ5DQpzYXZlKGxpc3QgPSAnYmlvbWFya2VyX3JhdycsIA0KICAgICBmaWxlID0gJ2RhdGEvYmlvbWFya2VyLXJhdy5SRGF0YScpDQoNCmJpb21hcmtlcl9yYXcNCmJpb21hcmtlcl9jbGVhbg0KYGBgDQoNCmBgYHtyfQ0KIyMyLiBQbG90IHRoZSBkaXN0cmlidXRpb24gZm9yIG1lYW4gdmFsdWVzDQpzZXQuc2VlZCgxMjMpDQpuIDwtIHNhbXBsZSgyOjEzMTksc2l6ZSA9IDY1MCkNCiMjIEdldCBtZWFuIHZhbHVlcw0KIyMgRm9yIHJhdyBkYXRhDQpyYXdfbWVhbiA8LSBiaW9tYXJrZXJfcmF3ICU+JSBzZWxlY3QoYWxsX29mKG4pKSAlPiUgc3VtbWFyaXNlX2FsbChmdW5zKG1lYW4pLG5hLnJtPVRSVUUpICU+JSBnYXRoZXIodmFyLCB2YWwpDQoNCiMjIEZvciBjbGVhbiBkYXRhDQpjbGVhbl9tZWFuIDwtIGJpb21hcmtlcl9jbGVhbiAlPiUgc2VsZWN0KGFsbF9vZihuKSkgJT4lIHN1bW1hcmlzZV9hbGwoZnVucyhtZWFuKSxuYS5ybT1UUlVFKSAlPiUgZ2F0aGVyKHZhciwgdmFsKQ0KDQojIyBwbG90IHRoZSBoaXN0b2dyYW1zDQpwYXIobWZyb3c9YygxLDIpKQ0KIyMgUmF3IG1lYW4NCmhpc3QocmF3X21lYW4kdmFsLGJyZWFrcyA9IDI1MCx4bGltID0gYygwLDMwMDAwKSxtYWluID0gIkhpc3RvZ3JhbSBvZiByYXdfbWVhbiIpDQoNCiMjIENsZWFuIG1lYW4NCmhpc3QoY2xlYW5fbWVhbiR2YWwsbWFpbiA9ICJIaXN0b2dyYW0gb2YgY2xlYW5fbWVhbiIpDQpgYGANCg0KYGBge3J9DQojIzMuIFBsb3QgdGhlIGRpc3RyaWJ1dGlvbiBmb3Igc2luZ2xlIHZhcmlhYmxlcw0Kc2V0LnNlZWQoMTIzNCkNCm4yIDwtIHNhbXBsZSgyOjEzMTksc2l6ZSA9IDUpDQojIyBHZXQgbWVhbiB2YWx1ZXMNCiMjIEZvciByYXcgZGF0YQ0KcmF3X3ZhciA8LSBiaW9tYXJrZXJfcmF3ICU+JSBzZWxlY3QobjIpDQoNCiMjIEZvciBjbGVhbiBkYXRhDQpjbGVhbl92YXIgPC0gYmlvbWFya2VyX2NsZWFuICU+JSBzZWxlY3QobjIpDQoNCiMjIFB1bGwgdGhlIG5hbWVzDQpjb2wubmFtZSA8LSBjb2xuYW1lcyhyYXdfdmFyKQ0KDQojIyBwbG90IHRoZSBoaXN0b2dyYW1zIGZvciBzaW5nbGUgdmFyaWFibGUgZGlzdHJpYnV0aW9uDQpmb3IgKGkgaW4gMTo1KSB7DQogIHBhcihtZnJvdz1jKDEsMikpDQojIyBSYXcgDQpoaXN0KHB1bGwocmF3X3ZhcltpXSksbWFpbiA9IHJiaW5kKGNvbC5uYW1lW2ldLCJyYXciKSx4bGFiPXJiaW5kKGNvbC5uYW1lW2ldLCJyYXciKSkNCiMjIENsZWFuIA0KaGlzdChwdWxsKGNsZWFuX3ZhcltpXSksbWFpbiA9IHJiaW5kKGNvbC5uYW1lW2ldLCJjbGVhbiIpLHhsYWI9cmJpbmQoY29sLm5hbWVbaV0sImNsZWFuIikpDQp9DQpgYGANCg0KSW4gdGhpcyBwYXJ0LCB3ZSBtYWlubHkgZHJhdyB0aGUgaGlzdG9ncmFtcyB0byBzZWUgdGhlIGRpc3RyaWJ1dGlvbnMgb2YgdGhlIGRhdGEgYmVmb3JlIGFuZCBhZnRlciBsb2cgdHJhbnNmb3JtYXRpb25zLiBUaGVuLCB3ZSB0cnkgdG8gY29tcGFyZSBhbmQgZmluZCB0aGUgcHJvcGVydGllcyBvZiB0aGUgZGlzdHJpYnV0aW9ucyBhZnRlciBsb2cgdHJhbnNmb3JtYXRpb24uIA0KDQpJbiBzdGVwMiwgdGhlIGRhdGEgYXJlIGNvbGxlY3RlZCBmcm9tIHRoZSBtZWFuIHZhbHVlcyBvZiA2NTAgcmFuZG9tIHNlbGVjdGVkIHZhcmlhYmxlcyBiZWZvcmUgYW5kIGFmdGVyIGxvZyB0cmFuc2Zvcm1hdGlvbnMuIEluIHRoaXMgY2FzZSwgd2Ugd2FudCB0byBjb21wYXJlIHRoZSBkaXN0cmlidXRpb25zIG9mIHRoZSBtZWFuIHZhbHVlcyBiZWZvcmUgYW5kIGFmdGVyIGxvZyB0cmFuc2Zvcm1hdGlvbnMsIHdoaWNoIGNvdWxkIHJlcHJlc2VudHMgdGhlIGRpc3RyaWJ1dGlvbiBvZiB0aGUgd2hvbGUgZGF0YSBzZXQuIEFjY29yZGluZyB0byB0aGUgdHdvIGhpc3RvZ3JhbXMgYWJvdmUsIGl0IGlzIHByZXR0eSBvYnZpb3VzIHRoYXQgdGhlIGRpc3RyaWJ1dGlvbiBvZiBtZWFuIHZhbHVlcyBhcmUgaGlnaGx5IHJpZ2h0IHNrZXdlZC4gQmVzaWRlcywgdGhlIHJhbmdlIG9mIHRoZSBkaXN0cmlidXRpb24gaXMgdmVyeSBsYXJnZSBldmVuIGlmIHdlIHNldCB0aGUgeGxpbS4gSG93ZXZlciwgYWZ0ZXIgbG9nIHRyYW5zZm9ybWF0aW9uLCBpdCBpcyBlYXN5IHRvIGZpbmQgdGhhdCB0aGUgcmFuZ2Ugb2YgY2xlYW4gZGlzdHJpYnV0aW9uIGJlY29tZXMgbXVjaCBzbWFsbGVyIChmcm9tIC0wLjA2IHRvIDAuMDUpLCBhbmQgdGhlIGRpc3RyaWJ1dGlvbiBhcmUgbW9yZSBjZW50ZXJlZCB0byBtaWRkbGUgYXQgeD0wLiBCZXNpZGVzLCBjb21wYXJlZCB0byB0aGUgcmF3IGRhdGEgZGlzdHJpYnV0aW9uLCB0aGUgbmV3IGRpc3RyaWJ1dGlvbiBhcmUgbm90IHRoYXQgc2tld2VkIHRvIHRoZSByaWdodC4NCg0KSW4gc3RlcDMsIHdlIG1haW5seSByYW5kb20gc2VsZWN0IDUgcHJvdGVpbnMgdG8gc2VlIHRoZSBkaXN0cmlidXRpb24gY2hhbmdlIG9mIHRoZSBzaW5nbGUgdmFyaWFibGUgYmVmb3JlIGFuZCBhZnRlciB0aGUgbG9nIHRyYW5zZm9ybWF0aW9uLiBJbiB0aGlzIGNhc2UsIHdlIGNvdWxkIGZpbmQgaG93IGxvZyB0cmFuc2Zvcm1hdGlvbiBhZmZlY3QgdGhlIGRpc3RyaWJ1dGlvbiBvZiBzaW5nbGUgcHJvdGVpbiBsZXZlbC4gVmVyeSBTaW1pbGFyIHRvIHdoYXQgd2Ugb2JzZXJ2ZSBmb3IgdGhlIGRpc3RyaWJ1dGlvbnMgb2YgbWVhbiB2YWx1ZXMgYWJvdmUsIHdlIGNvdWxkIGZpbmQgdGhhdCB0aGUgZmlyc3QgZm91ciByYXcgZGlzdHJpYnV0aW9uIGFyZSBza2V3ZWQgdG8gdGhlIHJpZ2h0IHdpdGggbGFyZ2UgcmFuZ2VzLiBGb3IgdGhlIGxhc3QgdmFyaWFibGUsIENITDEsIGl0IGFsc28gc2xpZ2h0bHkgc2tld2VkIHRvIHRoZSByaWdodC4gQWZ0ZXIgbG9nIHRyYW5zZm9ybWF0aW9uLCBtb3N0IG5ldyBkaXN0cmlidXRpb25zIGJlY29tZSBtdWNoIGxpa2VseSB0byBzdGFuZGFyZCBub3JtYWwgZGlzdHJpYnV0aW9uIGNlbnRlcmVkIGF0IHg9MCwgd2l0aCByYW5nZSAtMyB0byAzLiBPbmx5IGZvciBoblJOUCBLLCBpdHMgbmV3IGRpc3RyaWJ1dGlvbiBpcyBzdGlsbCBza2V3ZWQgdG8gcmlnaHQgYmVjYXVzZSBpdHMgb3JpZ2luYWwgZGlzdHJpYnV0aW9uIGlzIHRvbyBza2V3ZWQuDQoNClRoZXJlZm9yZSwgaXQgaXMgZWFzeSB0byBmaW5kIHRoYXQgbG9nIHRyYW5zZm9ybWF0aW9uIGNvdWxkIGhlbHAgdXMgdHJhbnNmb3JtIG91ciBkYXRhIGZyb20gYW4gaGlnaGx5IHNrZXdlZCBkaXN0cmlidXRpb24gdG8gYSBub3JtYWwgZGlzdHJpYnV0aW9uLCBhbHNvIGRlY3JlYXNpbmcgdGhlIHJhbmdlIG9mIHRoZSBkYXRhIHNldC4gVGhlcmUgYXJlIGEgbG90IG9mIGFkdmFudGFnZSB0byBkbyB0aGUgbG9nIHRyYW5zZm9ybWF0aW9uLiBGaXJzdCBvZiBhbGwsIGFmdGVyIGRlY3JlYXNpbmcgdGhlIHJhbmdlIG9mIHRoZSBkYXRhLCB3ZSBjb3VsZCBlYXNpbHkgY2x1c3RlciB0aGUgbWVhbnMgYW5kIHZhcmlhbmNlcyBvZiBkaWZmZXJlbnQgdmFyaWFibGVzIHRvIGEgc21hbGwgcmFuZ2UsIHdoaWNoIGNvdWxkIGhlbHAgdXMgZWFzaWx5IG9ic2VydmUgYW5kIG9wZXJhdGUgdGhlbS4gDQoNCk1vcmUgaW1wb3J0YW50LCBpZiB3ZSB3YW50IHRvIG1ha2UgcmVncmVzc2lvbiBtb2RlbCB3aXRoIHRob3NlIGRhdGEgaW4gdGhlIGZ1dHVyZSwgdGhlIG9yaWdpbmFsIGRhdGEgbWlnaHQgaGF2ZSBzb21lIGRpc2FkdmFudGFnZXMuIFdoZW4gbW9kZWxpbmcgdmFyaWFibGVzIHdpdGggbm9uLWxpbmVhciByZWxhdGlvbnNoaXBzLCB0aGUgY2hhbmNlcyBvZiBwcm9kdWNpbmcgZXJyb3JzIG1heSBhbHNvIGJlIHNrZXdlZCBuZWdhdGl2ZWx5LiBJbiB0aGVvcnksIHdlIHdhbnQgdG8gcHJvZHVjZSB0aGUgc21hbGxlc3QgZXJyb3IgcG9zc2libGUgd2hlbiBtYWtpbmcgYSBwcmVkaWN0aW9uLCB3aGlsZSBhbHNvIHRha2luZyBpbnRvIGFjY291bnQgdGhhdCB3ZSBzaG91bGQgbm90IGJlIG92ZXIgZml0dGluZyB0aGUgbW9kZWwuIE92ZXIgZml0dGluZyBvY2N1cnMgd2hlbiB0aGVyZSBhcmUgdG9vIG1hbnkgZGVwZW5kZW50IHZhcmlhYmxlcyBpbiBwbGF5IHRoYXQgaXQgZG9lcyBub3QgaGF2ZSBlbm91Z2ggZ2VuZXJhbGl6YXRpb24gb2YgdGhlIGRhdGEgc2V0IHRvIG1ha2UgYSB2YWxpZCBwcmVkaWN0aW9uLlRoZXJlZm9yZSwgdGhlIHRyYW5zZm9ybWVkIGRhdGEgY291bGQgZWZmZWN0aXZlbHkgZGVjcmVhc2UgdGhlIGRlcGVuZGVuY3kgYW1vbmcgdmFyaWFibGVzIHRvIGRlY3JlYXNlIHRoZSBjaGFuY2VzIG9mIG92ZXIgZml0dGluZyBtb2RlbCwgYW5kIGRlY3JlYXNlIHRoZSBwcmVkaWN0aW9uIGVycm9ycyBhdCB0aGUgc2FtZSB0aW1lLiBUaHVzLCB1c2luZyB0aGUgdHJhbnNmb3JtYXRpb24gb2Ygb25lIG9yIG1vcmUgdmFyaWFibGVzIGltcHJvdmVzIHRoZSBmaXQgb2YgdGhlIG1vZGVsIGJ5IHRyYW5zZm9ybWluZyB0aGUgZGlzdHJpYnV0aW9uIG9mIHRoZSBmZWF0dXJlcyB0byBhIG1vcmUgbm9ybWFsbHktc2hhcGVkIGJlbGwgY3VydmUuDQoNCg0KDQoNCg0KDQoNCg0K